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I Abstract 

I We analyze and evaluate an online gradient descent algorithm with adaptive per-coordinate adjust- 

■ ment of learning rates. Our algorithm can be thought of as an online version of batch gradient descent 
with a diagonal preconditioner. This approach leads to regret bounds that are stronger than those of 
standard online gradient descent for general online convex optimization problems. Experimentally, we 

■ show that our algorithm is competitive with state-of-the-art algorithms for large scale machine learning 
' problems. 

o\ 

>—\ ' 1 Introduction 

1 In the past few years, online algorithms have emerged as state-of-the-art techniques for solving large-scale 
machine learning problems I13[ 116) . In addition to their simplicity and generality, online algorithms are 

' natural choices for problems where new data is constantly arriving and rapid adaptation is imporant. 

^ , Compared to the study of convex optimization in the batch (offline) setting, the study of online convex 

optimization is relatively new. In light of this, it is not surprising that performance-improving techniques 
QQ , that are well known and widely used in the batch setting do not yet have online analogues. In particular, 

' convergence rates in the batch setting can often be dramatically improved through the use of preconditioning. 

, Yet, the online convex optimization literature provides no comparable method for improving regret (the online 

' analogue of convergence rates). 

, A simple and effective form of preconditioning is to rc-parameterize the loss function so that its magnitude 

' is the same in all coordinate directions. Without this modification, a batch algorithm such as gradient descent 

^ I will tend to take excessively small steps along some axes and to oscillate back and forth along others, slowing 

convergence. In the online setting, this rescaling cannot be done up front because the loss functions vary 
^ . , over time and are not known in advance. As a result, when existing no-regret algorithms for online convex 

' optimization are applied to machine learning problems, they tend to overfit the data with respect to certain 



X 



features and underfit with respect to others (we give a concrete example of this behavior in fj2]). 

We show that this problem can be overcome in a principled way by using online gradient descen10 
with adaptive, per-coordinate learning rates. Our algorithm comes with worst-case regret bounds (see 
Theorem [3]) that are never worse than those of standard online gradient descent, and are much better 
when the magnitude of the gradients varies greatly across coordinates (this structure is common in large- 
scale problems of practical interest). Extending this approach, we give improved bounds for generalized 
notions of strong convexity, bounds in terms of the variance of cost functions, and bounds on adaptive regret 
(regret against a drifting comparator). Experimentally, we show that our algorithm dramatically outperforms 
standard online gradient descent on real-world problems, and is competitive with state-of-the-art algorithms 
for online binary classification. 



^When loss functions are drawn IID, as when online gradient descent is applied to a batch learning problem, the term 
stochastic gradient descent is often used. 
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1.1 Background and notation 

In an online convex optimization problem, we are given as input a closed, convex feasible set F. On each 
round t, we must pick a point Xt G F. We then incur loss ft{xt), where ft is a convex function. At the end 
of round t, the loss function /( is revealed to us. Our regret at the end of T rounds is the difference between 
our total loss and that of the best fixed x E F in hindsight, that is 

T c T 

Regret = ^ft{xt)- min I ^ ftix) 
t=i ^ {t=i 

Sequential prediction using a generalized linear model is an important special case of online convex 
optimization. In this case, each xt £ M" is a vector of weights, where Xt^i is the weight assigned to feature i 
on round t. On round t, the algorithm makes a prediction pt{xt) — £{xt -Ot), where 9t € is a feature vector 
and £ is a fixed link function (e.g., £{a) = for logistic regression, £{a) — a for linear regression). 

The algorithm then incurs loss that is some function of the prediction pt and the label j/t G M of the example. 
For example, in logistic regression the loss is ft{x) — yt \ogpt{x) + (l — yt) log(l — pt(a:)), and in least squares 
linear regression the loss is ft{x) = {yt — pt{x))'^. In both of these examples, it can be shown that ft is a 
convex function of x. 

We are particularly interested in online gradient descent and generalizations thereof. Online gradient 
descent chooses xi arbitrarily, and thereafter plays 

xt+i = P{xt - T]tgt) (1) 

where 771, 772, . . . , ?7t is a sequence of learning rates, gt € ^ ft{xt) is a subgradient of ft{xt), and P{x) = 
argmiUj^g^ — ?/||} is the projection operator, where || • || is the L2 norm. When the learning rates are 
chosen appropriately, online gradient descent obtains regret 0{GD^/T), where D — maxj, ,,^^ {H^; — y||} is 
the diameter of the feasible set and G — maxt is the maximum norm of the gradients. Thus, as 

T 00, the average loss of the points xi, 2:2, . . . , xt selected by online gradient descent is as good as that 
of any fixed point x E F m the feasible set. It is perhaps surprising that this performance guarantee holds 
for any sequence of loss functions, and in particular that the bounds holds even if the sequence is chosen 
adversarially. 

2 Motivations 

It is well-known that batch gradient descent performs poorly in the presence of so-called ravines, surfaces 
that curve more steeply in some directions than in others |15j . In this section we give examples showing that 
when the slope of the loss function or the size of the feasible set varies widely across coordinates, gradient 
descent incurs high regret in the online setting. These observations motivate the use of per-coordinate 
learning rates (which can be thought of as an adaptive diagonal preconditioner). 

2.1 A motivating application 

Consider the problem of trying to predict the probability that a user will click on an ad when it is shown 
alongside search results for a particular query, using a generalized linear model. For simplicity, imagine there 
is only one ad, and we wish to predict its click-through rate on many different queries. On a large search 
engine, a popular query will occur orders of magnitude more often than a rare query. For queries that occur 
rarely, it is necessary to use a relatively large learning rate in order for the associated feature weights to 
move significantly away from zero. But for popular queries, the use of such a large learning rate will cause 
the feature weights to oscillate wildly, and so the predictions made by the algorithm will be unstable. Thus, 
gradient descent with a global learning rate cannot simultaneously perform well on common queries and on 
rare ones. Because rare queries are more numerous than common ones, performing poorly on either category 
leads to substantial regret. 
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2.2 Tradeoffs in one dimension 



We first consider gradient descent in one dimension, with a fixed learning rate rj (later we generalize to 
arbitrary non- increasing sequences of learning rates). 

If 77 is too large, the algorithm may oscillate about the optimal point and thereby incur high regret. As a 
simple example, suppose the feasible set is [0, D], and the loss function on each round is ft{x) = G\x — e|, for 
some small positive e. Then Vft{x) = —G if a; < e and Vft{x) — G ii x > e. It is easy to verify that if the 
algorithm plays xi = initially, it will play xt — on odd rounds and xt — Grj on even rounds, assuming 
e < Grj < D. Thus, after T rounds the algorithm incurs total loss ^Ge + -jG{Gri ~ e) — -jG^rj. Always 
playing x — e would incur zero loss, so the regret is -jG^rj. 

On the other hand, if 77 is too small then xt may stay close to zero long after the data indicates that a 
larger x would incur smaller loss. For example, suppose ft{x) ~ —Gx always. Then xt = min{_D, (t — l)Gri}. 
For the first rounds, Xt < -y and therefore our per-round regret relative to the comparator a; = Z? is at 

least ^ on these rounds. Thus, overall regret is at least ^ min |r, 2§;y| = assuming that < T. 
Thus, for any choice of r/ there exists a problem where 

where the upper bound is adapted from Zinkevich |17j . Thus, by setting rj — (which minimizes the 

upper bound) we minimize worst-case regret up to a constant factor. Note that this choice of 77 satisfies the 
constraints ^ < Gri < £>, as was assumed earlier. 

The fact that the optimal choice of 77 is proportional to captures a fundamental tradeoff. When the 
feasible set is large and the gradients are small, we must use a larger learning rate in order to be competitive 
with points in the far extremes of the feasible set. On the other hand, when the feasible set is small and 
the gradients are large, we must use a smaller learning rate in order to avoid the possibility of oscillating 
between the extremes and performing poorly relative to points in the center. 

Because the relevant values of D and G will in general be different for different coordinates, a gradient 
descent algorithm that uses the same learning rate for all coordinates is doomed to either underfit on 
some coordinates or oscillate on others. To handle this, we must use different learning rates for different 
coordinates. Furthermore, because the magnitude G of the gradients is not known in advance and can change 
over time, we must incorporate it into our choice of learning rate in an online fashion. 

2.3 A bad example for global learning rates 

We now exhibit a class of online convex optimization problems where the use of a coordinate-independent 
learning rate forces regret to grow at an asymptotically larger rate than with a per-coordinate learning rate. 
This result is summarized in the following theorem. 

Theorem 1. There exists a family of online convex optimization problems, parameterized by their lengths 
(number of rounds T), where gradient descent with a non-increasing global learning rate incurs regret at least 
n{Ti), wh ereas gradient descent with an appropriate per-coordinate learning rate has regret 0{\/T). 



The r2(T3 ) lower bound stated in Theorem [T] does not contradict the previously-stated 0{GD\/T) upper 
bound on the regret of online gradient descent, because in this family of problems D — Te (and G = 1). 

Proof of Theorem [II To prove this theorem, we interleave instances of the two classes of one-dimensional 
subproblem discussed in ^12. 21 setting G = 1 and setting the feasible set to [0, 1]. We have one subproblem of 
the first type, lasting for Tq rounds, followed by G subproblems of the second type, each lasting Ti rounds. 
Each subproblem is assigned its own coordinate. Formally, the loss function is 



_ r \xt,i -e\ ff i < To 

" I -xtj ff t > To where j = 1 



Ti 
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On each round, only one component of the gradient vector is non-zero. Thus, running gradient descent 
with global learning rate 77 is equivalent to running a separate copy of gradient descent on each subproblem, 
where each copy uses learning rate ?/. Moreover, overall regret is simply the sum of the regret on each 
subproblem. Thus, by the lower bounds stated t ^2.21 regret is at least 

(note that G = D = 1). 

- 2 1 

If we set C = Ti = Tq , this expression is J7(T3 ). To see this, first note that ii Ti < ^ then the second 

term is already ^{Tq) = fl{T3) (note that T = Tq + < 2To). Otherwise, a simple minimization over 

2 

•q shows that the sum is ^1{Tq). Because regret on the first subproblem is an increasing function of 77, and 
regret on all later subproblems is a decreasing function of ry, the same r2(T3) lower bound holds for any 
non-increasing sequence 771, 772, • . • , ?7t of per-round learning rates. Thus, we have proved the first part of the 
theorem. 

Now consider the alternative of letting the learning rate for each coordinate vary independently. On a 
onc-dimensional subproblem with feasible set [0, 1] and gradients of magnitude at most 1, gradient descent 
using learning rate ^ on round s of the subproblem obtains regret 0{-\fS) on a subproblem of length S 
[17) . Thus, if we ran an independent copy of this algorithm on each coordinate, we would obtain regret 
0{\/% + CVTi) = 0{V%) = 0{VT), which completes the proof. □ 



3 Improved Regret Bounds using Per-Coordinate Learning Rates 

Zinkevich |17] proved bounds on the regret of online gradient descent (which chooses Xt according to Equa- 
tion ([T])). Building on his analysis, we improve these bounds by adjusting the learning rates on a per- 
coordinate basis. Specifically, we obtain these bounds by constructing the vector yt by 

yt,i = xt,i - gt,iVt,i (2) 

where rjt is a vector of learning rates, one for each coordinate. We then play Xt — P{yt)- We prove bounds 
for feasible sets defined by axis-aligned constraints, F = x"^j^[ai, 6^]. Maiiy machine learning problems can 
be solved using feasible sets of this form, as our experiments demonstrate^ 



3.1 A better global learning rate 

We first give an improved regret bound for gradient descent with a global (coordinate-independent) learning 
rate. In the next subsection, we make use of this improved bound in order to prove the desired bounds on 
the regret of gradient descent with a per-coordinate learning rate. 

Zinkevich |17j showed that if we run gradient descent with a non- increasing sequence t^i, 7^2, . . . , ?7t of 
learning rates, regret is bounded by 

T 

B{num, ■■■.Vt) = + ^ E \\9t\W. (3) 

To guard against the worst case, it is natural to choose our sequence of learning rates so as to minimize 
this bound. Doing so is problematic, however, because in the online setting the gradients gi, (72, • ■ • , 5t are 
not known in advance. Perhaps surprisingly, we can come within a factor of of the optimal bound even 
without having this information up front, as the following theorem shows. 

^Our techniques can be extended to arbitrary feasible sets using a somewhat different algorithm, but the proofs are signicantly 
more technical I14| . 
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Theorem 2. Setting rjt = 



= yields regret D J 2Y,t=i \\9tV = V^-Rmin, where Rmin = min^i,^^^..., 



■ Vt- Vi>V2>--->Vt 



'Ht { 



Proof. Plugging the formula for r]t into ([3]), and then using Lemma [T] (below), we see that regret is bounded 

by 




1.9*1 



1 V2E:=iII.9.IP. 



< D 



\ 



2E 



.9*1 



We now compute i?rnin. First, note that if rjt > rjt+i for some t then we could reduce the second term in 
B{{r]t}) by making rjt smaller. Because the sequence is constrained to be non-increasing, it follows that the 
bound is minimized using a constant learning rate rj. A simple minimization then shows that it is optimal 



to set r/ 



VsLi lift 



= . which gives regret DJj2t=i\\9t 



□ 



A related result appears in [T] , giving improved bounds in the case of strongly convex functions but worse 
constants than ours in the case of linear functions. 

Lemma 1. For any non-negative real numbers Xi^X2t ■ ■ ^Xn, 



E 



< 2, 



Proof. The lemma is clearly true for 7i = 1. Fix some n, and assume the lemma holds for rt — 1. Thus, 



E 



< 2, 



\ i=l ' 



where we define Z = J2"=i^i ^'^'^ a; = x„. The derivative of the right hand side with respect to x is 
^'^^ ^ + which is negative for x > 0. Thus, subject to the constraint a; > 0, the right hand side is 

maximized at x = 0, and is therefore at most 2-\/Z. □ 



3.2 A per-coordinate learning rate 

We can improve the above bound by running, for each coordinate, a separate copy of gradient descent 
that uses the learning rate given in the previous section (see Algorithm [T]). Specifically, we use the update of 



Equation ^ with rj^ i 



where Di = hi — at is the diameter of the feasible set along coordinate i. 



The following theorem makes three important points about the performance of Algorithm [T] (i), its 
regret is bounded by a sum of per-coordinate bounds, each of the same form as ([3]); (ii) the algorithm's 
choice of rjt^i gives a regret bound that is only a factor of V2 worse than if the bound had been optimized 
knowing gi, g2, ■ ■ ■ , gr in advance; and, (m), the regret bound of Algorithm [T] is never worse than the bound 
for global learning rates stated in Theorem [5] Futhermore, as illustrated in Theorem [TJ the per-coordinate 
bound can be better by an arbitrarily large factor if the magnitude of the gradients varies widely across 
coordinates. 
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Algorithm 1 Per-coordinate gradient descent 



Input: feasible set F — x"^j^[ai,6i] 
Initialize xi — and Di ^ hi — ai. 
for f = 1 to T do 
Play the point xt. 

Receive loss function ft, set gt — ^ ft{xt)- 

Let j/t+i be a vector whose i*'' component is yt+i,i — Xt^i — ijt,igt,i, where r]t^i — , f ' ^ 

Set xt+i = P(yt+i). 
end for 



Theorem 3. Let F = x"^j^ [a^, &i] . Then, Algorithm]^ has regret bounded by X]"=i ^ii{''lt,i}): where 



^VT,i 2 f-^ 



Setting rjt^i 



, the bound becomes 



where RL 



i=i ^ t=i i=i 

minj^j j.^j .>,,2 ^>...>rij, i {Bi{{r]t,i})}- This is a stronger guarantee than Theorem\^ in that 



(4) 



1 \ t=i 



\9t\ 



(5) 



where D — diameter of the set F . 

Proof. Zinkevich[17) showed that, so long as our algorithm only makes use of V ft{xt), we may assume without 
loss of generality that ft is linear, and therefore ft{x) = gt ■ x for all x ^ F. If F is a hypercube, then the 
projection operator P{x) simply projects each coordinate Xi indepdently onto the interval [ai,bi\. Thus, in 
this special case, we can think of each coordinate i as solving a separate online convex optimization problem 
where the loss function on round t is gt i ■ x. Thus, Equation ([3]) implies that for each i, 

T 

^gt,txt,i 



mm 



v \ 

.t=i ) 



<B^{{Vt,^}) 



Summing this bound over all i, we get the regret bound 



E^* ■ ^* ~ 1 E^* ■ ^ r - '^B^({'^t,i})- 



(6) 



i=l 



Applying Theorem [5] to each one-dimensional problem, we get Bi {{rjtj}) — Di^2 X^tLi 5? i — • i?min Vi. 

To prove inequality ([S]), let _D e R" be a vector whose i*'' component is Di, and let g e M" be a vector 
whose i*'* component is \/2X]tli5?ii so the left-hand side of ([5]) can be written as D ■ g. Then, using the 



Cauchy-Schwarz inequality. 



D-g< \\D\\-\\g\\ = 



\ 1=1 \ 1=1 t=i 



n T 

2EE 



9u 



The right hand side simplifies to Dy 2j2t=i \\9t 



□ 
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4 Additional Improved Regret Bounds 



The approach of bounding overall regret in terms of the sum of regret on a set of one-dimensional problems 
can be used to obtain additional regret bounds that improve over those of previous work, in the special case 
where the feasible set is a hypercube. The key observation is captured in the following lemma. 

Lemma 2. Consider an online optimization problem with feasible set F = x"^j^[ai,&i] and loss functions 
/i, /2, . . . , /t- For each t, let £t{x) = '}2^=i^t,i{xi) be a lower bound on ft (i.e., ft{x) > itix) for all 
X F). Further suppose that ft{xt) — it{xt) for all t, where {xt\ is the sequence of points played by an 
online algorithm. Consider the composite online algorithm formed by running a 1-dimensional algorithm 
independently for each coordinate i on feasible set [ai, hi] C R", with loss function tt,i on round t. Let 

i? = E/.(-0"mm|f]/,(x)| 
t=i lt=i ) 

be the total regret of the composite algorithm, and let 

T ( T 

Ri =y^Jt,ii^t.i) - min <y^it.iixi) 

be the regret incurred by the algorithm responsible for choosing the i*"^ coordinate. Then R < Ri- 
Proof Because ft{x) > it{x) Vx, and ft{xt) = £t{xt), 

T 
t=l 

^jZ^tixt) 

i=l 

□ 

Importantly, for arbitrary convex functions, we can always construct such independent lower bounds by 
choosing £t{x) = ft{x) + Vf{xt){x — xt), as long as we add a "bias" coordinate where Oi = bi = 1. A similar 
observation was originally used by Zinkevich [17j to show that any algorithm for online linear optimization 
can be used for online convex optimization. We used this fact in the proof of Theorem [31 where we only 
analyzed the linear case. 

This simple lemma has powerful ramifications. We now discuss several improved guarantees that can be 
obtained by applying it to known online algorithms. For simplicity, when stating these bounds we assume 
that the feasible set is = [0,1]" and that the gradients of the loss functions are componentwise upper 
bounded by 1 (that is, \{'^ ft{xt))i\ < 1 for all t and i). 



T 

I 



min < it (x) 



. t=i 



1=1 ^ ' lt=l ) i=l 



4.1 More general notions of strong convexity 

A function / is i/-strongly convex if, for all x,y € F, it holds that f{y) > f{x) + Vf{x) ■ {y — x) + ^\\y — xW^ . 
Strongly convex functions arise, for example, when solving learning problems subject to L2 regularization. 
Bartlett et al. [1] give an online convex optimization algorithm whose regret is 

O • min Vt, — log T 

where H is the largest constant such that each ft is iJ-strongly convex. We can generalize the concept 
of strong convexity as follows. We say that / is strongly convex with respect to the vector H if, for all 
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& F, f{y) > f{x) + V f{x) ■ {y — x) + ^iVi ~ ^i)"^ ■ Suppose we run the algorithm of Bartlett et 

al. mdependently for each coordinate, feeding back lt,i{Vi) = 7ift{xt) + ft{xt)'i ■ {Vi - xt^i) + ^{vi - Xt,iY 
to the algorithm responsible for choosing coordinate i (we can always choose Hi > H). Applying Lemma [2J 
we obtain a regret bound 




This bound is never worse than the previous one, and is better if the degree of strong convexity differs 
substantially across different coordinates (e.g., if using different L2 regularization parameters for different 
classes of features). 



4.2 Tighter bounds in terms of variance 

Hazan and Kale [S] give a bound on gradient descent's regret in terms of the variance of the sequence of 
gradients. Specifically, their algorithm has regret 0{\fnV), where V = Ylit=i Wd* ~ mIP s-^d fJ. ~ Ym=i 9t' 
where gt = Vft{xt)- 

By running a separate copy of their algorithm on each coordinate, we can instead obtain a bound of 
0(ELi VW): where V, = Ef=i(ffM " mO'- 

To compare the bounds, let v G M" be a vector whose i*'' component is \/Vi, and let 1 e M" be a vector 
whose components are all 1. Note that \\v\\ = Vi = VV. Using the Cauchy-Schwarz inequality, 

n 

^ 1 • w < ||1|| • ||w|| = VnV . 

Thus, the bound obtained by running separate copies of the algorithm for each coordinate is never worse 
than the original bound, and is substantially better when the variance Vi varies greatly across coordinates. 



4.3 Adaptive regret 

One weakness of standard regret bounds like those stated so far is that they bound performance only in terms 
of the static optimal solution over all T rounds. In a non-stationary environment, it is desirable to obtain 
stronger guarantees. For example, suppose the feasible set is [0,1], ft{x) — x for the first ^ rounds and 
ft{x) = —X thereafter. Then an algorithm that plays xt = for all t has regret, yet its loss on the final -j 
rounds is ^ worse than if it had played the point x ^ 1 for those rounds. Indeed, standard regret-minimizing 
algorithms fail to adapt in simple examples such as this. 

Hazan and Seshadhri [10] define adaptive regret as the maximum, over all intervals [ro,Ti], of the regret 

Y^JLtq ft{xt) — miuajgp' |X]^ro /*(^)| incurred over that interval. For i?-strongly convex functions, their 

algorithm achieves adaptive regret O (-^ log^ T). 

By running an independent copy of their algorithm on each coordinate, we can obtain the following 
guarantee. Consider an arbitrary sequence Z — (zi, Z2, . . . , zt) of points in F, and let Rz — X]t=i ftixt) — 
ft{zt) be the regret relative to that sequence. Holding H constant for simplicity, the adaptive regret bound 
just stated implies that the algorithm of Hazan and Seshadhri [10] obtains Rz = 0{{N + 1) log^ T), where 
N is the number of values of t for which zt ^ zt+i (this follows by summing adaptive regret over the N + 1 
intervals where Zt is constant). Using separate copies for each coordinate, we instead obtain 

where Ni is the number of values of t where Zt^i ^ zt+i,^. This bound is never worse than the previous one, 
and is better when some coordinates of the vectors in Z change more frequently than others. 
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Table 1: Hinge loss and accuracy in the online setting on binary classification problems. 



Data 


Global 


Per-Coord 


cw 


PA 


Hinge loss 










BOOKS 


0.606 


0.545 


0.871 


0.672 


DVD 


0.576 


0.529 


0.851 


0.637 


ELECTRONICS 


0.509 


0.452 


0.802 


0.555 


KITCHEN 


0.470 


0.419 


0.787 


0.520 


NEWS 


0.171 


0.140 


0.512 


0.245 


RCVl 


0.076 


0.070 


0.542 


0.094 


Fraction of mistakes 








BOOKS 


0.259 


0.211 


0.215 


0.254 


DVD 


0.238 


0.208 


0.203 


0.240 


ELECTRONICS 


0.209 


0.175 


0.177 


0.194 


KITCHEN 


0.180 


0.151 


0.153 


0.175 


NEWS 


0.064 


0.050 


0.054 


0.060 


RCVl 


0.027 


0.025 


0.039 


0.034 



This provides an improved performance guarantee when the environment is stationary with respect to 
some coordinates and non-stationary with respect to others. This could happen, for example, if the effect of 
certain features (e.g., features for advertisers in certain business sectors) changes over time, but the effect of 
other features remains constant. 

5 Experimental Evaluation 

In this section, we evaluate gradient descent with per-coordinate learning rates experimentally on several 
machine learning problems. 

5.1 Online binary classification 

We first compare the performance of online gradient descent with that of two recent algorithms for text clas- 
sification: the Passive- Aggressive (PA) algorithm [3], and confidence- weighted (CW) linear classification [7]. 
The latter algorithm has been demonstrated to have state-of-the-art performance on large real-world prob- 
lems [13]. 

We used four sentiment classification data sets (Books, Dvd, Electronics, and Kitchen), available from [6], 
each with 1000 positive examples and 1000 negative examples! as well as the scaled versions of the 
rcvl. binary (677,399 examples) and news20. binary (19,996 examples) data sets from LIBSVM [3]. For 
each data set, we shuffled the examples and then ran each algorithm for one pass over the data, computing 
the loss on each event before training on it. 

For the online gradient descent algorithms, we set F — [—R, i?]" for R = 100. We found that the learning 
rate suggested by Theorem [3] was too aggressive in practice when the feasible set is large (note that it moves 
a feature's weight to the maximum value the first time it sees a non-zero gradient for that feature). In 
order to improve performance, we did some parameter tuning. For Algorithm [I] (Per-Coord), we scaled the 
learning rate formula by a factor of 0.6/i?, and for the global learning rate (Global) we scaled it by 0.2/ R. 
We estimate the diameter D in the global learning rate formula online, based on the number of attributes 
seen so far. For CW, we found that the parameters 4> = l.O and a = 1.0 worked well in practice. 

Table [T] presents average hinge loss and the fraction of classification mistakes for each algorithm. The 
Global and Per-Coord algorithms are designed to minimize hinge loss, and at this objective the Per-Coord 
algorithm consistently wins. CW and PA are designed to maximize classification accuracy, and on this 

^We used the features provided in processed_acI.tar.gz, and scaled each vector of counts to unit length. 
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Table 2: Additive regret incurred in the online setting, for logistic regression on various ads data sets. 



Data set 



Global Per-Coord 



HEALTH INSURANCE 



TELEFONICA 



LIFE INSURANCE 



FOREX 



BUSINESS CARDS 



AUTO INSURANCE 



CREDIT REPORT 



CREDIT CARDS 



SHOE 



0.215 0.028 

0.261 0.034 

0.225 0.029 

0.148 0.012 

0.158 0.025 

0.232 0.032 

0.231 0.032 

0.263 0.050 

0.171 0.026 



objective Per-Coord and CW are the best algorithms. The fact that the classification accuracy of Per-Coord 
is comparable to that of a state-of-the-art binary classification algorithm is impressive given the former 
algorithm's generality (i.e., its applicability to arbitrary online convex optimization problems such as online 
shortest paths). 

5.2 Large-scale logistic regression 

We collected data from a large search engin^ consisting of random samples of queries that contained a 
particular phrase, for example "auto insurance" . Each data set has a few million examples. We transformed 
this data into an online logistic regression problem with a feature vector 6t for each ad impression, using 
features based on the text of the ad and the query. The target label tt is 1 if the ad was clicked, and -1 
otherwise. The loss function ft is the sum of the logistic loss, log (1 -I- exp(— £4X46*4)), and an L2 regularization 
term. 

We compare gradient descent using the global learning rate from i)3.1l with gradient descent using the 
per-coordinate rate given in §3.21 We scaled the formulas given in those sections by 0.1; this improved 
performance for both algorithms but did not change the relative comparison. The feasible set was [—1, 1]". 

Table [2] shows the regret incurred by the two algorithms on various data sets. Gradient descent with 
a per-coordinate learning rate consistently obtains an order of magnitude lower regret than with a global 

learning rate. To calculate regret, we computed the static optimal loss muix^p |X]4=i /t(^)| by running 

our per-coordinate algorithm through the data many times until convergence. 



The use of different learning rates for different coordinates has been investigated extensively in the neural 
network community. There the focus has been on empirical performance in the batch setting, and a large 
number of algorithms have been developed; see for example |12) . These algorithms are not designed to 
perform well in an adversarial online setting, and for many of them it is straightforward to construct examples 
where the algorithm incurs high regret. 

More recently, Hsu et al. [11] gave an algorithm for choosing per-coordinate learning rates for gradient 
descent, derive asymptotic rates of convergence in the batch setting, and present a number of positive 
experimental results. 

Confidence-weighted linear classification and AROW [5] are similar to our algorithm in that they 
make different-sized adjustments for different coordinates, and in that common features are updated less 
aggressively than rare ones. Unlike our algorithm, these algorithms apply only to classification problems 
and not to general online convex optimization, and the guarantees are in the form of mistake bounds rather 
than regret bounds. 

^No user-specific data was used in these experiments. 
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In concurrent work [T3], we generalize the results of this paper to handle arbitrary feasible sets and 
a matrix (rather than a vector) of learning rate parameters. Similar theoretical results were obtained 
independently by Duchi et al. [5]. 
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